Elevated 500 errors with projects in AWS us-west-2
Incident Report for Hasura
Postmortem

A misconfiguration in our infrastructure for AWS us-west-2 region resulted in a partial outage (5XX errors) starting at around 3:34 UTC on Jan 6th, 2023. At this time, only a portion of our customers in the us-west-2 region were affected. The 5XX errors elevated region wide at around 6:00 UTC. At this time, all of our customers running their projects in us-west-2 were partially affected. At 06:15 UTC, this escalated to affect all projects in the region.

Hasura infrastructure team identified the misconfiguration and resolved the underlying issue and systems were back to normal at 6:29 UTC.

The misconfiguration was traced back to our gateway instances, where it couldn’t get updates to changes in the load balancing configuration. A restart attempt to fix the issue resulted in the cached information to be lost and that resulted in a region-wide outage for about 30 minutes, out of which 15 minutes leading up to resolution was a full outage.

Hasura SRE team will be reviewing this incident in full detail and will implement additional safeguards and fixes as follow up items in response to our internal RCA.

Posted Jan 06, 2023 - 20:02 UTC

Resolved
This incident has been resolved.
Posted Jan 06, 2023 - 06:58 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 06, 2023 - 06:30 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 06, 2023 - 06:29 UTC
Investigating
We are currently investigating this issue.
Posted Jan 06, 2023 - 06:24 UTC
This incident affected: Hasura Cloud AWS Regions (AWS: Oregon (us-west-2)).