Postmortem for February 17th Incident - Sigma GCP


On 2023-02-17, Sigma users on GCP saw queries from Sigma failing for a period of roughly 1 hour. This was a result of a bug in our event logging process resulting in resources not being released for use by new queries. The incident was mitigated by reverting a recent code change that exposed this bug within an hour, and the bug itself was then fixed shortly thereafter.

Incident start time: 05:45 UTC, February 17th, 2023
Incident end time: 06:37 UTC, February 17th, 2023

Root Cause

Our logging service writes event logs for each query. If an error occurred while writing a given query’s logs, a bug in the logging service would prevent the resource used to write that log from being released for use for writing logs for other queries. Given enough errors, this resource could become constrained, resulting in subsequent queries waiting indefinitely on resources to become available in the logging service.

Mitigation and Fixes

Immediate mitigation:
Reverting the code change that was inducing errors when writing to our event logging system mitigated the issue.

Root cause fix:
The bug in the logging system code that was preventing the release of resources for erroneous logs was fixed. Logs could then subsequently release their resources on error as expected.