Postmortem for January 17, 2023 incident - Sigma AWS

Summary

On 01/17/2023, Sigma users hosted on AWS experienced a high rate of errors when performing regular operations for approximately two hours. This was due to a memory overload in one of our backend services that handles SQL query generation. Boosting the service’s available memory and isolating the subset of queries causing the overload restored capacity and mitigated the incident.

Incident start time: 19:45 UTC January 17, 2023
Incident end time: 21:25 UTC January 17, 2023

Root Cause

A small subset of uniquely complex queries overloaded the memory capacity of our backend service that handles SQL query generation. This caused unrelated queries to experience high queueing times and to eventually timeout. Our internal monitoring and logging systems detected the service interruption but it took some time to isolate the precise queries causing the service to become overloaded even when its available memory had been increased.

Mitigations and Fixes

Immediate mitigations:

Increasing the available memory for the affected service by 50%.
Identifying and isolating the subset of queries triggering the memory overload to temporarily allocated and dedicated resources as a stop-gap mitigation.
Closely monitoring our services’ resource consumption.

Timeline for fixes:

[Done] Complete isolation capability for the subset of memory intensive and heavy queries
[Q1 2023] Investigate the offending queries to solutionize short-term tactical optimizations that will reduce memory load and prevent memory overutilization.
[Q2 2023] A long-term project is currently underway to improve the architecture of our backend services that will be more resource efficient for resource-intensive queries.

1 Like