Postmortem for May 2nd, 2023 Creating, Sharing, Publishing Documents Incident - Sigma AWS & Sigma GCP


On May 2nd, 2023 Sigma organizations hosted on AWS US and GCP cloud regions experienced an inability to create, publish or share Sigma documents. Additionally, user invites done within the timeframe of the incident silently failed. The incident occurred because database migration tasks caused a key table relevant to all Sigma documents to go into a locked state.

Incident Start Time: Approximately 13:52 UTC May 2, 2023
Incident End Time: Approximately 14:47 UTC May 2, 2023

Root Cause

A bug in our code caused our database migrations to lock out a key table for an extended period. This impacted a number of user workflows such as creation and publishing of Workbooks, sharing documents, creation of folders, etc.

Mitigations & Fixes

The issue was mitigated by halting the migration and the root cause has been addressed by a change in our migration code ensuring changes are applied efficiently.

Future Corrective Actions

  • Enhance our alerting system for faster identification and mitigation by minimizing latency between emergency notifications and communication channels, ensuring rapid, efficient updates.
  • Implement procedural enhancements by refining our internal process for managing migrations and backfills, for increased efficiency and accuracy.
  • Implement fast abort mechanisms for similar jobs if they cause problems.
1 Like