Mastodon service staffing

This is a somewhat hastily written reply to https://hachyderm.io/@esk/113793277371908181 .

I disagree a bit with Esk's finger-in-the-air estimates for number of staff required, their location, and their costs, and it didn't really fit in a single post or thread. In particular, I think it undercounts the number of people required to run the service, and ignores moderation.

As you'll see when you read this there are some big gaps in my knowledge around some best practices, particularly around moderation, so this is absolutely a draft. If you know more about this (or other topics) covered here and can suggest improvements then please do.

So, you want to launch a professionally run Mastodon service. How many people do you need?

Very broadly this breaks down in to three groups of people; technical people for running the service, moderators for, well, moderating the service, and everyone else.

Service infrastructure people

I keep seeing people doing this wrong, so lets do it right.

And the way you do it right is to start by considering the service you actually want to provide. Everything else flows from that.

In this case I think there's a bare minimum level of service you should commit to if you're going to charge people for the service. Leaving issues, customer requests, and more, unanswered for days on end doesn't cut it.

So lets say that a professionally run Mastodon service should aim for 99.9% availability.

That in itself is not very helpful. Specifically, you can't talk about availability without defining what service has to be available, what you even consider to be available, and so on.

Picking something very simple, for the purposes of this post I'll define the service is available if a successful response to a well-formed authenticated GET request for an account's home timeline is sent within 300ms of receiving the request.

Just to head off the inevitable bike shedding; no, this doesn't consider what happens if a single account is 100% unavailable because of data corruption or other issues. It also doesn't consider the freshness of the data in that home timeline, or the latency between the user taking an action (e.g., sending a post) and the first attempt to federate that action out to the rest of the network. A more complete SLA would absolutely cover those issues. I'm not doing that here because, as we'll see, it doesn't really affect the staffing picture.

Given that definition, we probably want our service to be available 99.9% of the time (measured over some rolling window).

For clarity, that allows ~ 45 minutes of hard downtime per month, or ~ 9 hours per year.

That immediately implies you need 24x7 on-call coverage for the service. If you don't have that an incident occurs during one of your on-call coverage gaps you're going to blow past your downtime budget pretty quickly. Remember, don't set an SLA you're not staffed to provide.

24 x 7 on-call implies you have staff in at least two sites / timezones, each site on-call 12x7, with at least 6 people per site (see https://nikclayton.writeas.com/rules-for-a-healthy-on-call-rotation for why).

Moderation people

Like the tech, you need an SLA for moderator coverage. I don't know what that might look like, so I'm going to guess that "Any account report sent to moderators will have initial triage started within 30 minutes of receiving the report" is reasonable.

How long are moderation shifts? Probably not the 12h that tech on-call shifts are, in order the limit the mental toll on moderators. 4h (again, finger in the air estimate) feels about right, so 6 moderators per day (3 per site) plus another 2 per site to cover for holidays and unexpected unavailability. So 10 moderators across 2 sites.

Everyone else

You need to fill other roles (some people might have multiple roles)

HR
Corp. infrastructure. Even if everyone is fully remote you need VPNs, auth, etc. This might overlap with the service infrastructure team
Marketing
Legal
Accounts
etc.

Staff so far

Now we're at 16 staff for service infrastructure and moderation, plus some additional number for other business roles.

There's probably not enough constant work for 16 service and moderation employees, but you need them for the bursty nature of the work and to maintain the SLA.

So what else are they doing?

Moderators, probably not much. It would be a part time role, and needs to allow time off to recover. A moderator comes on shift, gets up to speed, moderates for N hours (per the above, N=4), and then goes off shift after handing over to the next person.

Maybe - if the moderation load is light - it's possible for a single moderation team to moderate multiple servers that have the same moderation policies.

But then... Why have the multiple servers? If they run the same software and have the same moderation policies they could be one server, the user experience is going to be similar...

The service infra team, though have time to do projects. But what sort of projects.

It's tempting to think they can be sending PRs to improve the Mastodon software. I don't think that will work very well. Unless it's very carefully managed a team of 12 sending PRs will rapidly overwhelm Mastodon GmbH.

There's definitely local infrastructure projects for the service. Documenting best practices. Developing other services that can integrate with the Mastodon software stack (moderation tooling, etc) that would keep some of the team busy.

I do think there is something else to try -- work with Mastodon GmbH to second (some/all) of those people to Mastodon GmbH for, say, 50% of their project work time (roughly 5-7 days per month). Essentially, a much closer collaboration than throwing PRs over the wall and hoping some of them stuck. That is not without its own problems through as it increases the management work that Mastodon GmbH has to do to wrangle all this, and team members that are only present for 25% of the month present their own challenges.

Can you reduce the number of staff?

Not without significant tradeoffs for employee quality of life. On the tech side a team of 6 being on-call translates in to each person feeling the on-call burden every ~ 5 weeks (because of vacations, illnesses, etc). Being on-call more often than that is the route to burnout and not having enough time to carry out project work.

You could reduce the number of moderators a bit with longer moderation shifts - 6 or 8 hours instead of 4. That might be OK, I don't know enough about moderation best practices to suggest whether that's actually a good idea or not.

Reducing the SLA doesn't meaningfully reduce the number of staff. Going from 99.9% to 99% allows for ~ 3.5 days of downtime per year. If you reduced staffing costs (e.g., by having no weekend cover, or overnight cover) two incidents occurring over a weekend can easily burn through that error budget. And saying "We have no weekend moderator coverage" is how you tell abusers to only abuse your users at the weekend.

What does this cost?

Assume everyone is working remotely (no money needed on offices)
Assume employee cost is 1.4 x salary (benefits, insurance, company supplied hardware, reimbursement for internet connection, etc)

Service infrastructure team

No reason to hire people in expensive places. For example, https://devjob.ro/en/salaries suggests Romanian tech salaries are EUR 3K/month, and, handily, Brazilian SWE salaries are about the same (in USD, and EUR/USD is very roughly 1:1).

So that's 12 employees x 3K month x 1.4 employee cost = **~ 51K / month**.

Moderator costs

No idea what's reasonable here. If we assume we pay a good moderator the same as a good SWE, but they're only working 33% of the time (4h shifts, remember), that's 10 moderators x 3K a month x 1.4 employee cost x 33% = ~17K / month.

Other costs

Complete guess that it's half the SWE cost. Some of that is salaries, some of that might payments to SaaS or similar for payroll, accounts, legal retainer, marketing, and so on. I suspect that's an underestimate.

Cost summary

51K (SWE) + 17K (Moderator) + 26K (other) = 94K / month

And that's trying to keep costs down. Hiring elsewhere could bump that up significantly. E.g., germantechjobs.de/en/salaries suggests that reasonable salary for a German SWE might be 62K / year, or about 5.2K / month.

6 SWEs in Brazil and 6 in Germany take the SWE costs to (6 x 3 x 1.4) + (6 x 5.2 x 1.4) = ~ 69K / month

Other notes

There are other challenges.

For example, the above assumes employees working from home on company provided equipment in Romania and Brazil, potentially with access to customer data from around the world. That's going to be a fun legal issue to navigate, as well as needing to ensure that auditing controls are in place to make sure customer data is not used inappropriately.

nikclayton/Mastodon service staffing.md