Postmortem on service degradation for Sigma-AWS

Postmortem for high query latencies encountered by customers


Over the last 1-2 weeks, users on your Sigma instance may have reported very high query latencies and failures to load elements of their Workbooks. The first incident happened on July 6th. Based on our investigation, we put certain mitigations in place while we investigated the root cause. Again on July 12th and July 13th, we mitigated a smaller but related incident. The new occurrences have given us sufficient details on the root cause and we understand what needs to be done to stabilize our system. We expect to implement the immediate fixes to address the stability issues today (July 13th).

We recognize that this incident has caused significant problems for our users. Sigma values the trust that our customers place in us and we apologize for the disruption this may have caused.

Root Cause

The cause has been traced to a specific part of our platform that routes the requests to fetch query results. This has surfaced due to new, heavy workloads that have accompanied recent growth in the number of queries handled by our platform. Specifically, a very large number of very long queries could block up our internal load balancer and cause delays across multiple workloads.

Mitigations and Fixes

Sigma engineering is handling this issue with top priority.

  • In the short term, we are working to isolate these new, heavy workloads in our infrastructure. This will ensure that such workloads do not impact multiple customers.
  • In the long term, we have identified architectural improvements that change the manner in which we route these requests and fetch results. These improvements will ensure that the affected part of the platform scales to any workload.
  • We have also identified certain gaps in our alerting that would have notified us of spikes in our query latencies. We will be fixing these gaps to ensure such alerts are triggered with high fidelity.

Timeline for fixes

We expect the short term fixes to land today (July 13th). This will mitigate the issues that we have identified. In parallel, we have started working on the identified architectural improvements and these will be rolled out in upcoming weeks.