2017-09-20 - Consumer app outage

Meeting waived: Henry Clifford

Incident Leader: Henry Clifford

Description

Beginning on 2017-09-20 the consumer api sent 500 (internal server error) codes to most consumer requests.

Timeline

At 2pm GMT September 19th we rolled out a major release to production along with running a data migration. QA had occurred on staging builds pointing at staging apis and the data migration was trialed on staging. At 2:15pm GMT September 19th a team member informed the "Incident Leader" the consumer app was not working, this was confirmed and a rollback to the last good version was performed. This appeared to resolve the issue. At 6:04 GMT September 21th a team member informed the eng team via slack the consumer app wasn't working. Damian discovered the api was sending 500 codes. Damian and Josh made it clear that master should work, then verified by pointing the production ios client and the consumer-staging api (which had the latest master). At 6:42 GMT Henry updated consumer-api to master and Damian began an elasticsearch rebuild At 6:49 GMT the app was confirmed to be working

Contributing Factor(s)

When the data migration was run some enums were depracted. They no longer existed in mongo, but elasticsearch did contain them. Elasticsearch was not rebuilt post-migration as it should have been. This led Henry to assume the api was at fault and perform an api rollback.

Stabilization Steps

Rebuilding elasticsearch to purge pre-migration data

Impact

Consumer app was offline for a minimum of 40minutes

Corrective Actions

Performing QA ahead of time using the production app would have caused the elasticsearch rebuild to have been caught faster. Migrations should be documented ahead of time and discussed to reduce the probability of missing steps.

hcliff/post-mortem-template.md