Skip to content

Instantly share code, notes, and snippets.

@1RedOne
Created April 22, 2025 16:46
Show Gist options
  • Save 1RedOne/bf34216f57187b741fe432000a4222f3 to your computer and use it in GitHub Desktop.
Save 1RedOne/bf34216f57187b741fe432000a4222f3 to your computer and use it in GitHub Desktop.
Recovering from an outage

Recovering from nuking production

If you've read about that time I destroyed Active Directory with a PowerShell script which saved me two minutes of work and took down all of prod (even locking people out of the parking deck and disabling internet access!) then you might wonder how I kept my job.

There are a few things I did to survive the experience.

  • Take my lumps and own up
  • Show up and participate in recovery
  • Document it and write the Post Incident Report
  • Learn from it

Own it

The very moment I knew that I caused a problem, I immediately told my manager. It was just a walk around the corner to tell him but I walked in, let him know something important had just happend and told him. I said "I was trying to optimize and standardize things to help us on-board new users but I made a mistake" and I just laid it all out.

Yes I was to blame, and yes no one asked to try and optimize this particular thing.

Show up and Own up

I couldn't lead the recovery effort, I was too new and did not know about Authoritative Domain Restores, but I could show up, offer to go pick up food and do my best to learn. Showing up on time, bringing donuts and taking the lumps is all important. Even more so it is important to not be defensive but embrace this as a learning opportunity.

Documentation and Learning

One thing we uncovered in our recovery was that our backup process was broken and not working. If we'd had even a simple spreadsheet of group memberships, we could have recovered ourselves without needing to call Microsoft.

So I lead the charge of writing up the Incident report and sharing and presenting it. I had to take lumps here as well, but it's a rite of passage.

Next, I used this moment to write a backup script that would document our Group memberships to help us recover if this ever took place again.

Learn from it

The most important thing? ** Do not make the same mistake again **. Few people will be fired except for the most shocking and over the top and avoidable errors.

In my experience, the decision to terminate someone will hinge on whether they tried to hide their errors, or refused to take ownership of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment