Postmortem for Aug 10th, 2023: Incorrect results for timezone-based calculations

Timeline for Affected Users on Snowflake
Incident Start Time: 2023-08-10 16:30 UTC
Incident End Time: 2023-08-10 18:49 UTC

Timeline for Affected Users on Databricks
Incident Start Time: 2023-08-14 16:00 UTC
Incident End Time: 2023-08-14 16:30 UTC

Summary

On August 10th, 2023 for a period of two hours, Sigma users using Snowflake connections with a custom account timezone set saw incorrect results for timezone-based calculations.

On August 14th, 2023 for a period of 30 minutes, the same issue reemerged for users with a custom account timezone set on Databricks connections.

Root Cause

This incident was caused by a code change made to one of Sigmas backend components caused timezones to not be set properly in Sigmas drivers that connect to the data warehouse. This code change resulted in timezones not being properly passed between backend components.

Timeline
2023-07-20 at 19:41: Code change to backend components pushed to internal testing environment.
2023-07-21 at 17:51: Reports of incorrect filter queries on internal testing environment.

2023-07-21 at 17:51: Code change on internal testing environment is reverted. Alterations are made which appear to fix the issue.
2023-08-07 at 18:10: Code change with alterations is redeployed to internal testing environment.

2023-08-10 at 16:58: Code change to backend components was made on production
2023-08-10 at 17:30: Internal test for backend component health fails in the case of Snowflake. Investigation into root cause begins.
2023-08-10 at 19:49: Code change on production is reverted for the Snowflake case. Issue subsides on Snowflake.
2023-08-14 at 15:59: Customer on Databricks reports issues with incorrect filter queries. Engineering identifies root cause to be same as with Snowflake case.
2023-08-14 at 16:30: Code change is reverted for the Databricks case. Issues subsides on Databricks.

Response

Investigation began at the first failed internal test. Once root cause was discovered, code change was reverted immediately.

Forward-Looking Preventative Measures

  • Multiple improvements planned for more robust timezone handling across each related backend component
  • Improvements to backend testing to include live, immediate detection for mishandling of timezones
  • Finalize fixes to original code change and redeploy once above safety measure are in place

Added Timezones