Postmortem for June 7, 2023 Infinite Loading Page on Sigma AWS US-West-2

Summary

On June 7, 2023 users in Sigma organizations hosted on the AWS US-West-2 region on all Connection types were unable to interact with Sigma and ran into an infinite loading page when logging into the platform, and opening or editing workbooks. The incident occurred due to a manual misconfiguration causing Sigma backend servers to fail. This failure cascaded via failing critical API requests (api-aws.sigmacomputing.com) that resulted in long, infinite wait times for customers hosted on the aforementioned Sigma instances.

Incident Start Time: Approximately 19:19 UTC Jun 7, 2023
Incident End Time: Approximately 19:23 UTC Jun 7, 2023

Root Cause

A manual error during ingress migration on the US-East-1 cluster for a disaster recovery (DR) region triggered a deletion of the ingress resource on the incorrect US-West-2 cluster instead of US-East-1. This caused a cascading delete to the AWS load balancer which omitted resolution against AWS Production (US).

Response

The erroneous action on the incorrect cluster was identified by the Engineer on June 7, 2023 at 19:12 UTC. The issue was mitigated rapidly by recreating the ingress resource and load balancer at 19:15 UTC after the unintended deletion and updating the Cloudflare resource to point to the newly generated ingress resource and load balancer by 19:22 UTC.

Forward-looking Preventative Measures

  • Enhance internal alerts to the backend, including adding a traffic based monitor to alert Support and Engineering if this condition occurs
  • Separate Ingress Resources from the Load Balancer to make erroneous deletions more difficult
  • Define an Infrastructure Runbook to refine manual operations against Production clusters