Postmortem for Sept 29, 2023 Incident: Missing search results from last 3 months affecting all Sigma deployments except AWS Canada

Summary

Around 20:55 UTC on the 29th of September, Sigma users were unable to find via search workbooks created in the last 3 months, due to an inadvertent infrastructure change.

All customers on Azure, AWS (us-west2 and eu-central-1), and GCP (us-central-1), using the file explorer in embeds, using the Sigma API for workbooks and files, or searching across Sigma were affected.

We restored the search database to properly include the latest information in the search results.

Incident Start Time: Approximately 20:44 UTC September 29, 2023

Incident End Time: Approximately 23:17 UTC September 29, 2023

Timeline

Timestamp (UTC) Event/Response
2023-09-29 20:55:00 Incident alerted in our internal monitoring systems
2023-09-29 21:10:00 Incident noticed and flagged internally.
2023-09-29 21:38:00 Azure eastus2 resolved on its own due to a small number of search updates
2023-09-29 21:46:00 Incident escalated internally
2023-09-29 21:47:00 First customer report
2023-09-29 22:30:00 Customer issue linked to incident
2023-09-29 22:31:00 Incident escalated to highest priority
2023-09-29 22:42:00 Manually restored to prior search database state for GCP
2023-09-29 22:50:00 GCP us-central1 resolved
2023-09-29 23:00:00 Manually restored to prior search database state for AWS us-west-2
2023-09-29 23:07:00 AWS us-west-2 resolved
2023-09-29 23:11:00 Manually restored to prior search database state for AWS eu-central-1
2023-09-29 23:17:00 AWS eu-central-1 resolved
2023-09-29 23:17:00 Issue resolved

Root Cause

A routine infrastructure change accidentally swapped the latest search database to a state from 3 months prior. This caused any changes in the past few months to not be visible in search.

This meant that any applications built on embedded Sigma which used the workbooks API for navigation failed to operate properly if they depended on documents newer than 3 months old.

Scope of Impact

All customers on Azure, AWS (us-west2 and eu-central-1) and GCP (us-central-1) using the file explorer in embeds, hitting the Sigma API for workbooks and files, or searching across Sigma were affected.

Mitigations

The search results returned to the correct state after rolling back to the previous healthy version of the search database.

Forward-looking Preventative Measures

  1. We’re looking to refine our escalation procedures in such cases for a faster resolution.
  2. Working to continuously improve our internal alerting system.
  3. Looking to put in place better guardrails to avoid such states.
  4. Replaced our mechanism for updating the search database state with a new mechanism which does not run the same risk of being triggered accidentally.

We apologize for the disruption and inconvenience that you have endured as a result of this situation. Your trust in us is of the utmost importance, and we are dedicated to taking the necessary measures to prevent similar incidents from occurring in the future. If you have any questions please reach out to our Support Team.

Thank you for your continuous support, patience and understanding.