Skip to content

Instantly share code, notes, and snippets.

@hcliff
Forked from juliandunn/post-mortem-template.md
Last active September 21, 2017 14:19
Show Gist options
  • Save hcliff/361012ff1d3a96a42b4b002e5bbaefdf to your computer and use it in GitHub Desktop.
Save hcliff/361012ff1d3a96a42b4b002e5bbaefdf to your computer and use it in GitHub Desktop.
Post mortem template

2017-09-20 - Consumer app outage

Meeting waived: Henry Clifford

Incident Leader: Henry Clifford

Description

Beginning on 2017-09-20 the consumer api sent 500 (internal server error) codes to most consumer requests.

Timeline

At 2pm GMT September 19th we rolled out a major release to production along with running a data migration. QA had occurred on staging builds pointing at staging apis and the data migration was trialed on staging. At 2:15pm GMT September 19th a team member informed the "Incident Leader" the consumer app was not working, this was confirmed and a rollback to the last good version was performed. This appeared to resolve the issue. At 6:04 GMT September 21th a team member informed the eng team via slack the consumer app wasn't working. Damian discovered the api was sending 500 codes. Damian and Josh made it clear that master should work, then verified by pointing the production ios client and the consumer-staging api (which had the latest master). At 6:42 GMT Henry updated consumer-api to master and Damian began an elasticsearch rebuild At 6:49 GMT the app was confirmed to be working

Contributing Factor(s)

When the data migration was run some enums were depracted. They no longer existed in mongo, but elasticsearch did contain them. Elasticsearch was not rebuilt post-migration as it should have been. This led Henry to assume the api was at fault and perform an api rollback.

Stabilization Steps

Rebuilding elasticsearch to purge pre-migration data

Impact

Consumer app was offline for a minimum of 40minutes

Corrective Actions

Performing QA ahead of time using the production app would have caused the elasticsearch rebuild to have been caught faster. Migrations should be documented ahead of time and discussed to reduce the probability of missing steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment