Post Mortem for Sep 6, 2023: Workbooks cannot be opened for all Azure-hosted orgs

Summary

On September 6, 2023, at approximately 19:50 UTC, Sigma users hosted on Azure Cloud were unable to access Sigma Workbooks. Sigma Engineering determined the impacted workbook access was due to an inability to access the Azure managed HSM key vault which stores customers’ credentials used to access customers’ Cloud Data Warehouses and other customer resources.

Working with Azure Support and Engineering, it was discovered that 1 of 3 Virtual Machines (VMs) used for the managed key vault had failed and prevented automatic recovery of the cluster, rendering the key vault to be non-operational. Azure Engineering manually restarted the suspect VM, which restored normal operation of the vault, resolving the issue.

Incident Start Time: Approximately 19:50 UTC Sep 6, 2023
Incident End Time: Approximately 21:45 UTC Sep 7, 2023

Root Cause

While performing maintenance operations on the Azure HSM Key Vault used by Sigma, the Key Vault failed in a way that necessitated Microsoft Engineering to intercede and fix the VM in a bad state, causing existing workbooks to not load for all Sigma users on Azure.

Timeline

2023-09-06 19:50 UTC: Sigma users hosted on Azure Cloud were unable to access existing workbooks, which Engineering determined was caused by a deleted DNS record and private HSM endpoint. Our Infrastructure team attempted maintenance operations to regain access to the key vault while encountering error messages indicating no operations can be performed on the key vault due to a concurrent operation in progress.

2023-09-06 21:26 UTC: Sigma filed a Support case with Azure as we were unable to debug the unknown error with the information available to us. Our engineers continued debugging with Azure Support Engineers but were unsuccessful and began looking at alternative solutions in parallel.

2023-09-07 21:40 UTC: Azure Support confirmed 1 of 3 VMs used for the key vault was in a bad state, causing the entire key vault to be non-operational.

2023-09-07 22:32 UTC: Azure restarted the suspect VM, restoring access to existing workbooks for impacted customers.

Future Corrective Actions

  1. We are enhancing our processes to store a key vault backup for faster recovery of the corrupted key vault contents and to expedite an alternative resolution to this kind of incident.

  2. We are working on refining exhaustive liveness checks and alerting systems for workbook functionality tests to notify Support and Engineering in the event workbook access is down.

  3. We are working with Microsoft Azure team to understand why the escalation path to the Azure HSM Engineering team took so long.

We deeply apologize for the disruption and inconvenience you experienced as a result of this incident. Your trust is of utmost importance to us, and we are committed to taking the necessary measures to prevent similar incidents in the future. If you have any questions or concerns, please reach out to our Support team.

Thank you for your understanding.

Added Azure