Postmortem for July 28, 2023 - Databricks "execute statement request" error

Summary

Starting on July 28, 2023, some Sigma users using Databricks would encounter the error “execute statement request error: HTTP Response code: 400” when loading workbooks. These errors were due to the warehouse connection entering a bad state after being idle for some time. The issue was mitigated initially on a per-customer basis, and later globally, and a fix to avoid this condition has now been deployed.

Incident Start Time: Approximately 18:00 UTC July 27, 2023
Incident End Time: Approximately 16:25 UTC July 31, 2023

Root Cause

Sigma Engineering is currently working on refactoring the connection management code to facilitate future feature development. This work is not expected to introduce any behavioral changes however unknowingly resulted in an increased chance of encountering the Databricks server-side idle timeout. The refactored code path was enabled for Databricks customers on July 27 and the following day, one Databricks customer reported receiving the error “execute statement request error: HTTP Response code: 400” when loading workbooks. Investigation of the issue led engineering to disable the refactored code path for that customer only, as the issue was believed to be isolated to that customer’s environment. On July 31, additional reports of the problem were received, and engineering subsequently disabled the refactored code path for all Databricks customers.

The root cause of the errors was ultimately determined to be prolonged reuse of connections to Databricks which eventually led them to deteriorate into an undesirable state after encountering the Databricks server-side idle timeout. It was confirmed with Databricks Engineering that SQL Warehouse connections get “cleaned up” on the server side after about an hour of inactivity and become unusable. There was no mechanism to detect this on the Sigma client side, and the errors returned to the client in this condition were generic “HTTP 400” messages rather than session timeout indications. Thus Sigma would retry queries on the connection, rather than attempt to establish a new connection.

It had always been possible for connections to enter this state, however the connection management refactoring code path had the unanticipated side effect of allowing connections to persist for longer.

Timeline

Timestamp (UTC) Event
2023-07-27 18:00 Refactored connection code path enabled for Databricks customers
2023-07-28 14:34 One Databricks customer reports “execute statement request error”
2023-07-28 17:01 Refactored connection code path disabled for that customer
2023-07-31 15:03 Engineering receives reports of additional customers encountering the error
2023-07-31 16:23 Refactored connection code path disabled for all other Databricks customers
2023-08-02 13:46 Change deployed setting max idle timeout for Databricks connections

Mitigations & Current Fixes

To mitigate this issue, the refactor was turned off for all Databricks users. Also to prevent the connections to Databricks from being reused for too long, a max idle timeout was set for Databricks connections. Currently communicating with Databricks Engineering to improve their error messages for this condition. This will allow Sigma to automatically respond to this specific condition, and facilitate debugging of similar conditions in the future.

Future Corrective Actions

  • Follow up with Databricks to provide more descriptive error messages for errors emitted from their backend - DONE
  • Fix API call error alerts and use structured Databricks errors
  • Log Databricks operation IDs/connection IDs to facilitate future debugging and improve connection calls in connection management

Added Databricks