Postmortem for Oct 03, 2023 Incident: Query failures for select organizations connected to Snowflake

Summary

Beginning October 3, 2023 around 13:00 UTC, many organizations that use Snowflake warehouse were unable to issue queries, receiving the error message, “Bad request; operation not supported”.

Since there was no release from our end and the errors were limited to Snowflake users, our engineering team collaborated with Snowflake’s team to address potential issues with Snowflake’s latest driver version, 7.35, that was released to several of our customers’ Snowflake accounts that morning.

After confirming an issue on Snowflake’s end, Snowflake completed the rollback to the older version, resolving the errors for all affected customer

Incident Start Time: Approximately 13:00 UTC October 3, 2023

Incident End Time: Approximately 18:03 UTC October 3, 2023

Timeline

Timestamp (UTC) Event/Response
2023-10-03 11:56:00 First occurrence of “operation not supported” error
2023-10-03 12:55:00 Customer reported and escalation was initiated
2023-10-03 13:33:00 Sigma filed a support case with Snowflake
2023-10-03 14:20:00 Snowflake ticket escalated
2023-10-03 16:01:00 We reported to Snowflake that we’re able to reproduce the issue internally and requested to be rolled back to 7.34
2023-10-03 16:31:00 Snowflake confirmed rolling back internal Sigma account to 7.34
2023-10-03 16:35:00 Confirmed with Snowflake that the rollback fixed these errors and requested them to roll back the release globally
2023-10-03 17:49:00 Errors are resolved for the vast majority of our customers (except test accounts, which were rolled back about two hours later)

Root Cause

This incident affected all Sigma queries from customers whose Go drivers (used by Sigma) were configured with account identifiers using the legacy format in the connection string and the “account” connection parameter.

It was triggered by a Snowflake server rollout containing a code change that consolidated the account resolution logic in the login request. Therefore, both Snowflake drivers (Go and JDBC) configured with account identifiers in the legacy format could no longer look up account locators with the cloud region ID, failing the login request.

The presence of this issue was confirmed by Snowflake Support in a follow-up message, indicating that a fix had been identified that will be released in version 7.35.1.

Scope of Impact

All customers with Snowflake connections whose drivers were configured with account identifiers using the legacy format and were upgraded to the 7.35 version had 100% error rates for all Sigma-related warehouse operations.

Mitigations

The errors customers saw were resolved when Snowflake rolled their accounts back to version 7.34 from 7.35. Snowflake confirmed that a fix has been identified for the reported issue, which was released in version 7.35.1.

Forward-looking Preventative Measures

  1. We’re working with Snowflake to configure internal testing accounts that always receive the latest server updates, ideally before any of our customers, so any failures can be acted upon immediately.
  2. We’re looking to invest in resources that will help surface Snowflake driver logs quickly in our environments. This helps Snowflake support identify the root cause faster.
  3. We have established a shared communication channel with engineers at Snowflake for quicker response times and more direct collaboration.

We apologize for the disruption and inconvenience that you have endured as a result of this situation. Your trust in us is of the utmost importance, and we are dedicated to taking the necessary measures to prevent similar incidents from occurring in the future. If you have any questions please reach out to our Support Team.

Thank you for your continuous support, patience and understanding.

Added Performance, PostMortem, Snowflake, Warehouse