On April 20th 2023, several Sigma organizations experienced degraded performance and sporadic failures of materializations and scheduled exports. A code change to our connection indexing system, which catalogs database objects for all organizations’ connections, introduced a bug that overloaded our task scheduling system, preventing scheduled runs for materializations and scheduled exports from running. A fix for this issue was implemented at approximately 12:20am UTC on April 21st, 2023.
Incident Start Time: Approximately 11:50am UTC April 20th, 2023
Incident End Time: Approximately 12:20am UTC April 21st, 2023
A code change to our connection indexing logic contained a bug that led to Sigma running indexing tasks indefinitely. The indexer shares infrastructure with scheduled workloads like materializations and scheduled exports. Due to contention in the task runner, users experienced degraded performance and sporadic failures of materializations and scheduled exports.
Mitigations & Fixes
At 4:55am UTC on Thursday, April 20th, timeout errors as a result of connection indexing were identified. In order to mitigate these timeout errors, connection indexing tasks were halted, which may have resulted in users experiencing a slowdown in navigating and searching through newly created connections. Our team identified and fixed the bug in the connection indexing logic, after which we re-enabled new connection indexing tasks. As a result, materializations and scheduled exports were able to run without issue.
Future Corrective Actions
- Actively prioritizing the improvement of observability across our system, including in our connection cataloging service, and conducting comprehensive reviews to identify any areas that require further enhancement.
- Implementation of supplementary tests to ensure similar issues do not impact our production environment
- Improved isolation of time-sensitive tasks like materialization and scheduled exports to prevent them from being affected by delays in other task processing