Skip to content

Instantly share code, notes, and snippets.

@nikclayton
Last active January 21, 2025 10:35
Show Gist options
  • Save nikclayton/d2805b0c0e77ea368fe59223df766492 to your computer and use it in GitHub Desktop.
Save nikclayton/d2805b0c0e77ea368fe59223df766492 to your computer and use it in GitHub Desktop.
Mastodon SLAs.md

Note

My previous post got some feedback from junior SWEs arguing -- roughly, and I'm paraphrasing -- the costs were way off and anyway this wasn't important because last year Hachyderm had "99.996% uptime" relying on a handful of volunteers.

This feels like a teachable moment, so here's an excerpt from the discussion that followed. I'm paraphrasing and removing names because the goal here is not to call out people earlier in their career, it's to highlight some of the things you need to think about when deciding if you want to run a service that people can rely on.


99.996% uptime or something

Right, but that's meaningless without knowing what the goal was, and an agreement on what "downtime" means. If the goal was higher then Hachyderm failed.

If the goal was significantly lower then Hachyderm infrastructure team could - perhaps - have afforded to take decisions that would have traded some of the reliability budget for something else. E.g., if you can do a task with great difficulty and no downtime, or do it much more easily with 5 minutes of downtime, taking the downtime might be the right decision. Context matters.

And it's still a largely meaningless number.

[...]

Yes, this is in the context of servers that are being run "as a business". Which is a synonym for a lot of things, like "Can reliably pay people for their labour", "Has professional moderation", "Will still be here in 12 months time", "I can use this as a reliable social media presence for my charity", "Posts from my bot that posts air quality stats will federate quickly" and so on.

I had a whole discussion about this with the strangeobject.space server operators in late July last year, where they insisted everything was fine, could continue running it as a hobby, and were at no risk of burnout.

Less than two months later they abruptly realised they couldn't, and shut it down with very little notice. My half of the discussion is

You can't see their half, because their service no longer exists...

Their shutdown notice is Shutting Down - strangeobject.space blog.

We don't have a "goal" — that was the number. (at least, I don't believe we have a "goal") 9/10 startups fail and all that

We don't have an SLA, I'm not setting an SLA, but I am saying that our uptime was 99.996%, which is a huge difference to what you did set as the SLA in your numbers. Like by a factor of 10x. Was trying to share perspective of someone who actually does operations for a large mastodon server, rather than just picking numbers

We don't set SLAs because we like setting SLAs.

We set SLAs because we're trying to capture an idea of "How badly can our service perform and our users are still happy?"

Note

As an aside, what we're really talking about here are SLOs ("Objectives"). An SLA is the agreement you have with a customer if you break the SLO. But since we started with SLA as the term I'll stick with it.

They're not the only tool in our toolbox that helps with this. And like any tool, used as intended they can provide useful insight, but use them carelessly and they can be very misleading.

The first problem with "our uptime was 99.996%" is that it doesn't explain what "uptime" is referring to.

Any moderately complex web application is a bunch of different services, API endpoints, and so on. Any one of them might fail, and the failure can have different effects on the service you're trying to provide.

Insight #1: An SLA should be provided per "thing the user is trying to do" ("user journey", if you will).

Sometimes there might be a 1:1 relationship between the thing and an API endpoint.

For example, boosting a post is a single action with a single API endpoint.

Sometimes there isn't, like the "Complete registering a client application and logging in" flow.

And sometimes the thing the user is trying to do is more nebulous than that.

For example, I hope we agree that users post because they want their followers to see the thing they've posted with low latency.

Some of that is outside any individual server's control. But you can control:

  1. Time to first attempt to federate a post
  2. Time between federation attempts if the remote server is unavailable
  3. Time the post is inserted into the home timeline of followers local to your server

A service with high uptime but very slow time to federate is very different for users than a service with slightly lower uptime but much faster time to federate.

Speaking of time to action, This leads us to

Insight #2: Availability is not enough for users to be happy with a service, it must be fast too. Fast is subjective, of course, and the speed of different actions depends on what the user is trying to do (bookmark a post vs. upload a 10MB video).

Nevertheless, I hope you can agree that a service with high uptime and high latency is a very different (and worse) user experience than one with high uptime and low latency.

Of course, different users, by nature of their location, ISP, internet infrastructure, and so, may have very different online experiences.

Insight #3: You should be able to slice your SLA data by user. If most of your users have good service, but some of them are seeing poor (or terrible) service, then lumping all users together under a single number does not achieve the underlying goal - some of your users are unhappy but you don't know about it.

Of course, user's perception of the quality of a service can also change over time. Users tend to rate more recent service impacting events as more serious than ones in the past (with the caveat that the duration of the event also plays a significant part). So:

Insight #4: SLAs should be measured over a rolling time window, with more recent events weighing more heavily on the SLA than more distant events. The simplest way to do this is a rolling window where all events inside the window count equally and all events outside the window are ignored.

In any complex distributed system there are parts that are under your control (e.g., the infrastructure you directly operate) and parts that are not (e.g., the user's ISP). Most users are not going to recognise the difference though, and their happiness -- remember, the thing we care about -- is dependent on the whole stack. This is unfair, but so is much of life.

Insight #5: SLAs should be measured as close to the user as possible (in a web app that means, if possible, from the browser). Wherever you're measuring from, though, it's important to be very clear about that in your SLA, as it helps highlight where the gaps in your knowledge are.

You server monitoring may be telling you that everything is fine. But if your servers happen to be hosted at providers that have patchy connectivity to the places where many of your users are then your users are having a bad time, and you don't know about it.

Back to "our uptime was 99.996%".

Fails test one -- it doesn't break this out by user action.

Fails test two -- it doesn't say anything about how fast the service is supposed to be.

Fails test three -- you can't break it down by user. Are all your users seeing this, or are some fraction of them having a significantly worse experience?

Fails test four -- there's no clear measuring period.

Fails test five -- it doesn't explain where it's measured from, or what the gaps in that measurement are.

That's not an SLA. It's not even a report of a single SLA. A single SLA might be:

Objective: The user should be able to reliably, and quickly, view the most recent version of their home timeline.

Indicator #1: Per-user, X% of the well-formed authenticated requests for /api/v1/timelines/home return HTTP 200, measured over a rolling 90 day window.

Indicator #2: Per-user, X% of the well-formed authenticated requests for /api/v1/timelines/home that returned HTTP 200 did so within 300ms (measured at the server, time from last byte of request to first byte of response, over a rolling 90 day window).

Is this a ton of work? Yes.

Do you need to do this for all services? Eventually, if you care about the happiness of your users, but you can iterate on it.

Do all servers need to do this? Only those that care about providing a service their users can depend on.

That last point is particularly important. There is absolutely space for federated servers that are not run like businesses, that do not have time for this sort of work, and where their users understand the quid pro quo they are entering in to with the server operator. Caveat emptor, and all that.

But -- I think -- that if the Fediverse is to be relevant long term then there needs to be servers run by people that understand this sort of stuff and are prepared to put the work in.

The horrific fires in LA demonstrate this; one of the laments has been how people can no longer rely on Twitter for timely, reliable information about the fires, current evacuation recommendations, and more.

Is there any server on the Fediverse right now where we could truthfully say that it could be relied on for this sort of critical information? I think the answer at the moment is "no", and in the future it should be "yes".

Does Hachyderm aspire to provide that service? I don't know, and there's no argument from me if Hachyderm explicitly declines to provide that service, that's absolutely their prerogative.

But whether or not they do, "our uptime was 99.996%" has very little value when it comes to understanding the happiness of our users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment