Postmortem for January 26, 2023 Snowflake incident - Sigma AWS & Sigma GCP


On 01/26/2023, Snowflake users hosted in the AWS - US West (Oregon) region experienced delays or errors when performing operations such as executing queries in Snowflake and Sigma and accessing the Snowflake UI. As a side-effect of this, all Sigma users not directly impacted also experienced degraded performance and errors specifically in the Usage tables in the Administration section. A small number of users experienced long query times and timeouts after the Snowflake incident was resolved due to a queuing bottleneck issue root caused internally.

Incident start time: 18:20 UTC January 26, 2023
Incident end time: 23:40 UTC January 26, 2023

Root Cause

The primary impact, as well as the side-effect causing degraded performance to Sigma’s Usage tables, was due to a technical outage in Snowflake warehouses hosted in the AWS - US West (Oregon) region. Snowflake’s postmortem for this incident can be viewed here.

After Snowflake resolved their incident, a small number of users had accumulated a queuing bottleneck of queries that could not run during the outage. These queues were failing to clear, causing long query times and timeouts unrelated to the initial outage. Further investigation revealed the root cause for this to be internal due to a previously unidentified deadlock condition within one of our services. For more information about this root cause, see our postmortem for a related incident here.

Mitigations and Fixes

Immediate mitigations:

Snowflake’s resolution of their incident mitigated the majority of impact from this incident. Thereafter, we identified the deadlock condition within our service that caused the queuing bottlenecks and deployed a fix for it.

Timeline for fixes:

[Done] Snowflake to resolve their outage
[Done] Fix deadlock conditions within our services
[Done] Improve internal alerting when errors due to query timeouts from queue size occur
[Q1 2023] Improve end-to-end tracking of queries in our system to be able to detect deadlock conditions.

1 Like