Postmortem for May 30, 2023 Incident -Connection Indexing bug - Sigma AWS

Date 05/30/2023

Summary

On 05/30/2023, Sigma users hosted on AWS experienced Timeouts on our APIs and high query latency due to connection indexing DB Utilization.

We noticed that all queries for Organizations within AWS were significantly slower resulting in a degraded performance in Sigma, with some operations on the site taking several seconds.

Incident Start Time 05/30/2023 4:20 UTC
Incident End Time 06/02/2023 6:35 UTC

Root Cause

Further investigation found the culprit to be the connection indexing tasks. The queries associated with connection indexing saturated the DB in terms of write throughput, which was causing all other queries to be slow.
Typically, these connection indexing tasks are completed before business hours. However, as these tasks were taking longer to complete, the run times for these tasks overlapped with other queries from Customers related to day to day business operations resulting in DB saturation and degraded performance due to high query latency.

Mitigations and Fixes

  • Connection indexing was disabled during the day. It fixed slow/failed workbook loads, API call timeouts, and exports failing, with the tradeoff that newly created connections were slow to navigate and existing connections needed to have their databases/schemas/tables manually synced if changes are made to those in the warehouse.

  • Upgraded DB Instance helped improve throughput and latency.

Timeline for Fixes

[Done] Upgraded DB Instance.
[Done] Improved monitoring for DB Utilization and trends.