A small set of free tier tenants suffered downtime

Incident Report for Hasura

Resolved

Between 21.30PM GMT(Sun) to 6AM GMT(Mon) Hasura Cloud experienced an issue that impacted approximately 5% of free-tier projects and caused those projects to suffer downtime. One of our worker nodes hosting a portion of free-tier tenants failed an internal health check and was marked inactive. Unfortunately, the process for auto-healing that worker did not happen as expected. Additionally, the monitoring that was set up to alert on health check failures missed this worker node. This meant that the projects hosted on that worker experienced over 8 hours of downtime. We acknowledge that this is an unacceptable length of downtime for customers depending on Hasura Cloud. Once we noticed this issue we were able to immediately reallocate the affected projects to another active worker node and restore functionality to all projects by 6.15AM GMT.

As part of our remediation of this event, we are enhancing our monitoring setup, defining more rigorous playbooks, and making changes to our internal health check system to increase robustness. We've improved our proactive monitoring to help alert us to any issues affecting projects so that downtime is prevented. We have also validated that our reallocation process works as expected when worker nodes are marked as inactive.

Posted Feb 01, 2021 - 06:30 UTC