Postmortem for May 1st, 2023 Production Outage On All Clouds Incident

Incident DATE: 2023-05-01
Time to resolution: 11 minutes


At approximately 12:12pm PT on May 1, 2023, customers began reporting errors in the UI, including in the workbook interface. At this time, we noticed that many requests to the Sigma Application Server were failing. The issue impacted most Sigma customers.

Root Cause

After an investigation by the Sigma Engineering team, it was discovered that this issue was caused by a code release which was deployed without all the necessary migrations in supporting components.

During a recent code release, a necessary migration was not run beforehand, which resulted in unexpected issues with the deployment. This happened because the system designed to halt code releases in such cases failed to function properly. Additionally, the migration workflow was not optimized, which may have contributed to the oversight.

Mitigations and Fixes

  • The Engineering team rolled back the deployment, and reverted the code changes until the migrations could be completed

Future Corrective Actions

  • New alerts have already been implemented to detect error spikes in our production database. This will help us identify and mitigate / resolve issues faster.
  • Improvements are being made to monitors that alert when there are missing migrations
  • We will create a Github bot to increase awareness around expectations and required actions for creating and running migrations.