PostMortem for August 24, 2023 - Errors in exports and existing workbooks on Sigma Azure

Summary

On August 24th around 6 AM PT, customers that host their Sigma org in the Azure:Production:Eastus2 region experienced difficulties loading their workbooks, receiving the error “The specified blob does not exist.” Impacted customers were unable to load draft workbooks, view workbook snapshots, or access workbooks that had not been created or modified in the last 24 hours.

We traced the underlying cause of this issue to a retention policy misconfiguration that unintentionally deleted customer workbook metadata. Fortunately, this data was recoverable, but through a laborious process which extended the outage for impacted customers.

Incident Start Time: 2023-08-24 12:53 UTC
Incident End Time: 2023-08-25 01:30 UTC

Root cause

The incident originated from a new retention policy applied to a storage container on Azure. This policy was intended to manage data retention for customer CSV uploads, which are stored temporarily before being saved to the customer’s cloud data warehouse. However, it was caused by mistakenly applying a data retention policy to a broader range of storage than intended.

Note that the underlying data was never at risk, as that is only stored in the cloud data warehouse.

Timeline

2023-08-23 21:00 UTC A 24-hour retention policy was applied to the top-level Azure container that stores workbooks along with customer uploads. This was intended to ensure that we do not store uploaded customer data beyond 24 hours.

2023-08-24 12:53 UTC Escalation was initiated, and Support was paged.

2023-08-24 17:00 UTC Sigma engineering disabled the upload retention policy from the affected Azure Storage Accounts.

2023-08-24 21:00 UTC Sigma engineering completed the creation and verification of a script to recover the workbook metadata. Began execution of the script across impacted customers.

2023-08-25 01:20 UTC Recovery script completes for all affected customers.

2023-08-25 01:52 UTC Sigma support updated the status page to indicate that the issue has been resolved for all affected accounts.

Mitigations and current fixes

  1. Removal of incorrect retention policy: We removed the improper retention policy from all affected Azure storage accounts, preventing further data loss.

  2. Data restoration: We developed and executed a script to restore all deleted customer workbooks. Because no bulk undelete option was available, the script had to iterate over more than 57,000 objects to recover the workbooks, which took considerable time.

  3. Customer verification: We identified the affected customers’ organizations, and applied the script to validate the successful restoration of customer workbooks.

Future corrective actions

  1. Improve disaster recovery plan: We will augment our disaster recovery plan to incorporate recovery procedures for accidental deletion.

  2. Optimize for recoverability: We will ensure all storage services across all cloud platforms are configured optimally to facilitate recovery.

  3. Enhance policy management: We will implement better processes for applying retention policies, ensuring they are scoped correctly to avoid accidental data loss.

  4. Separate storage: We will separate deletable and sensitive files into distinct containers or buckets, reducing the risk of inadvertent data removal.

  5. Enhance observability: We will implement improved monitoring and alerting mechanisms for cloud storage services, so we can detect and respond to anomalies more promptly.

We deeply apologize for the disruption and inconvenience you experienced as a result of this incident. Your trust is of utmost importance to us, and we are committed to taking the necessary measures to prevent similar incidents in the future. If you have any questions or concerns, please reach out to our Support team.

Thank you for your understanding.