Postmortem for April 17, 2023 Materialization and CSV Upload incident - Sigma AWS & Sigma GCP

Summary

On April 17 2023, some Sigma users experienced errors and failures for their materialization runs. This was due to an increase in materialization usage that overwhelmed an inefficient query. A fix for this was released on April 18. At around 02:00 UTC on April 19, a second more widespread incident occurred with users experiencing errors and failures for writeback operations such as materializations and CSV uploads. This latter incident was also due to a bug introduced by a change we had applied internally. A fix for this was released later that day.

Incident start time: Approximately 13:00 UTC April 17, 2023

Incident end time: Approximately 17:30 UTC April 19, 2023

Root Cause

The initial incident was due to our garbage collection query being inefficient and not scalable. Once there was an increase in materialization usage this query locked the entire table and took a long time to run, thus causing a lock wait time exceeded error for users.

The second incident emerged when a new index was added to speed up the performance of another query. The new index unintentionally slowed down the garbage collection query even more, due to MySQL executing an index merge occurring between the old and new indexes when serving that query. This led to a recurrence of the incident (and reports in CSV upload), resulting in an even greater impact on the users.

Mitigations and Fixes

For the initial incident, the slow query was identified and a fix was submitted to denormalize the table used for materializations to improve query performance. For the second incident where the new index caused an outage, this index was reviewed and dropped for immediate mitigation.

Timeline for fixes:

[Done] Denormalize table to improve query performance
[Done] Drop bad index
[Done] Prevent the garbage collection query from locking the whole table by separating selection and update
[Q2 2023] Set timeout for MySQL queries
[Q2 2023] Develop a clean-up process to remove expired materializations from tables
[Q2 2023] Investigate if table scan can be replaced with individual updates when materialization expires