Infrastructure issue impacting the health of our VMs across all regions

Resolved Feb 7, 2025 9:18 PM MST

The issue causing customer instances to restart has been fully resolved across all regions.

We identified the root cause as a misconfigured logging infrastructure that was keeping stale files open. This caused some VMs to slowly fill up and create conditions that caused our orchestration system to proactively reschedule timescaledb workloads. This rescheduling caused affected customer databases not configured with HA replication to experience 5-10 minutes of downtime. Since we were able to resolve the problem before all nodes were affected, this was isolated to a small fraction of our customers.

We have now restored storage on all nodes and deployed a fixed version of the logging infrastructure to prevent recurrence.

Investigating Feb 7, 2025 7:01 PM MST

We are actively investigating an infrastructure issue impacting the health of our VMs across all regions. Customers with instances running on affected VMs may experience restarts, resulting in approximately 5–10 minutes of downtime. Our team is working diligently to identify and resolve the root cause to restore full stability across all impacted VMs.