Post Mortem on September 20th, 2022 incident - Sigma GCP

Summary

On 09/20/2022, Sigma users on GCP experienced errors loading Workbooks, as well as materialization and scheduled export failures. This was an incident resulting from a certificate rotation that was done on 09/13 but took effect a week later.

On 09/13/2022, we rotated a root certificate that is critical to internal communication between services inside Sigma. This was done in accordance with best practices to ensure the continued security of our system. In the process, we missed issuing new certificates to specific systems, causing a communication breakdown once the old certificate expired.

Root Cause

Our certificate rotation was not complete and we missed rotating some certificates used by internal system components. These certificates are used for internal communication between Sigma’s services and at no time was any customer data at risk.

While our internal monitoring systems caught the issue immediately, we have identified and are actively working toward enhancements to our testing capabilities that will allow us to validate over time any similar future changes, in an isolated manner.

Mitigations and Fixes

  • Immediate fixes:
    • Rotated missed internal TLS certificates signed with the new root certificate
  • Mid term fixes:
    • Create “escalate and page on error” integrations for our critical system alerts to ensure we acknowledge regressions much faster
    • Create a canary environment in which we can perform tests of similar changes in an isolated manner

Timeline for Fixes

  • [Done] Rotated internal TLS certificates signed with the new CA
  • [Q4 2022] Create “escalate and page on error” integrations for our SLI probes for us to acknowledge these regressions much faster
  • [Q4 2022] Create an environment in which we can perform isolated tests of changes like this

Thanks for the detailed explanation.