A misconfiguration in our infrastructure for AWS us-west-2 region resulted in a partial outage (5XX errors) starting at around 3:34 UTC on Jan 6th, 2023. At this time, only a portion of our customers in the us-west-2 region were affected. The 5XX errors elevated region wide at around 6:00 UTC. At this time, all of our customers running their projects in us-west-2 were partially affected. At 06:15 UTC, this escalated to affect all projects in the region.
Hasura infrastructure team identified the misconfiguration and resolved the underlying issue and systems were back to normal at 6:29 UTC.
The misconfiguration was traced back to our gateway instances, where it couldn’t get updates to changes in the load balancing configuration. A restart attempt to fix the issue resulted in the cached information to be lost and that resulted in a region-wide outage for about 30 minutes, out of which 15 minutes leading up to resolution was a full outage.
Hasura SRE team will be reviewing this incident in full detail and will implement additional safeguards and fixes as follow up items in response to our internal RCA.