Postmortem for January 27, 2023 Big Query incident - Sigma AWS & Sigma GCP


On 01/27/2023, Sigma users hosted on BigQuery experienced a high number of “query timed out in queue” errors. This error had occurred previously at low rates, but during earlier investigations it was believed to be isolated to specific users’ query queue sizes being too low. Initial mitigation was to increase those users’ queue sizes. Subsequently we discovered that this mitigation actually masked the real root cause by prompting the offending backend service to restart, which had the inadvertent effect of temporarily alleviating the underlying issue. Deeper investigation revealed a previously unidentified deadlock condition within the service.

Incident start time: 3:00 UTC January 27, 2023
Incident end time: 18:40 UTC January 28, 2023

Root Cause

Further investigation revealed the root cause for this to be due to a previously unidentified deadlock condition within a codebase change released on 01/03/2023. If queries were canceled during a certain point in the queueing process, a deadlock would occur. This caused the queue location to not be released, eventually filling up the queue and disallowing other queries to proceed.

Mitigations and Fixes

Increasing the queue sizes for impacted users proved to be a temporary mitigation as the actual root cause was yet to be identified. Thereafter, we identified the deadlock condition within our service that caused the queuing bottlenecks and deployed a fix for that to resolve the issue permanently.

Timeline for Fixes:

[Done] Mitigation by increasing the queue sizes
[Done] Fix deadlock conditions within our services
[Done] Improve internal alerting when errors due to query timeouts from queue size occur
[Q1 2023] Improve end-to-end tracking of queries in our system to be able to detect deadlock conditions

1 Like