Summary
On 12/02/2022, Sigma users hosted on GCP experienced sporadic errors for a proportion of their queries for several hours. This was due to an interruption between two of our internal services caused by a persistently large error message being returned from a cloud data warehouse.
Root Cause
A particular query task persistently encountered an unexpectedly large error message being returned by a cloud data warehouse. This error message was sent within a header message by an upstream service, which overflowed a buffer in a downstream service and interrupted its normal operation. Our internal monitoring and logging systems detected that the interruption was occurring via a logged error, however the precise root cause was opaque.
Mitigations and Fixes
Immediate fixes:
- Increasing the size of the buffer of the affected downstream service
- Capping the size of the error message sent in the header by the upstream service
Timeline for Fixes
- [Done] Mitigation by increasing the size of the buffer of the affected downstream service and capping the size of the error message sent in the header by the upstream service
- [Q4 2022] Clearer, more actionable internal alerting for why the downstream service would be interrupted/unavailable or the upstream service would be throwing persistent errors
- [Q4 2022] Enforce a maximum size for all error messages ingested by the downstream service