Summary
On September 6, 2023 at approximately 16:30 UTC, Sigma users hosted on Google Cloud Platform (GCP) experienced interruptions to materializations and CSV uploads. Sigma Engineering determined the issue was caused by prolonged query execution time which blocked processes from accessing writeback tables, impacting functionalities such as materialization and CSV upload.
Further investigation by Engineering revealed that our internal database systems were receiving unexpected load due to an error with an internal maintenance task, resulting in a general slowdown in processing. Sigma Engineering temporarily suspended the maintenance task and reverted the changes which caused this unexpected load, resolving the issue.
Incident Start Time: Approximately 16:30 UTC Sep 6, 2023
Incident End Time: Approximately 19:40 UTC Sep 6, 2023
Root Cause
A query plan directive was put in place to improve the performance of a maintenance query on a different cloud deployment. This query plan directive had the unexpected opposite effect on GCP, resulting in a poor plan and long query runtime. As the query held locks on tables utilized by materializations and CSV uploads, these features were impacted.
Timeline
- 09/05/2023 23:00 UTC: We made changes to the maintenance query and added a query plan directive.
- 09/06/2023 15:30 UTC: The query plan directive created an adverse effect in GCP resulting in longer query execution time and excessive lock wait times.
- 09/06/2023 18:30 UTC: We suspended the maintenance task, reverted changes including the query plan directive, and resumed the maintenance task.
Future Corrective Actions
- We are enhancing our processes for retrieval and examination of slow query logs and plans in production environments.
- We are working on creating actionable alerts and runbooks to reduce mean time-to-resolution in the event of future materialization errors.
We deeply apologize for the disruption and inconvenience you experienced as a result of this incident. Your trust is of utmost importance to us, and we are committed to taking the necessary measures to prevent similar incidents in the future. If you have any questions or concerns, please reach out to our Support team.
Thank you for your understanding.