Postmortem for December 2nd, 2022 incident - Sigma GCP

Summary

On 12/02/2022, Sigma users hosted on GCP experienced sporadic errors for a proportion of their queries for several hours. This was due to an interruption between two of our internal services caused by a persistently large error message being returned from a cloud data warehouse.

Root Cause

A particular query task persistently encountered an unexpectedly large error message being returned by a cloud data warehouse. This error message was sent within a header message by an upstream service, which overflowed a buffer in a downstream service and interrupted its normal operation. Our internal monitoring and logging systems detected that the interruption was occurring via a logged error, however the precise root cause was opaque.

Mitigations and Fixes

Immediate fixes:

  • Increasing the size of the buffer of the affected downstream service
  • Capping the size of the error message sent in the header by the upstream service

Timeline for Fixes

  • [Done] Mitigation by increasing the size of the buffer of the affected downstream service and capping the size of the error message sent in the header by the upstream service
  • [Q4 2022] Clearer, more actionable internal alerting for why the downstream service would be interrupted/unavailable or the upstream service would be throwing persistent errors
  • [Q4 2022] Enforce a maximum size for all error messages ingested by the downstream service
1 Like