Hasura Cloud AWS us-east-1: Increased error rates and latency
Incident Report for Hasura
Postmortem

Incident Postmortem: Traffic Surge Impact on AWS us-east-1 Region

Duration: November 26, 2024, 9:00 AM - 11:40 AM Pacific Time
Impact: Service degradation and increased error rates for projects in AWS us-east-1 region

Incident Summary

On November 26, our AWS us-east-1 region experienced three significant traffic spikes, processing a total of 43.68M requests. This surge caused elevated latency and error rates across projects in the region, with some projects experiencing up to 90% dropped traffic during peak periods.

Timeline of Events

  1. First Spike (9:05 AM - 9:25 AM PT)
* 14.69M requests
* Peak of 40,000 requests/second at approximately 9:10 AM
  1. Second Spike (9:40 AM - 9:50 AM PT)
* 12.6M requests
* Peak of 50,000 requests/second at approximately 9:46 AM
  1. Third Spike (11:22 AM - 11:37 AM PT)
* 12.41M requests
* Peak of 30,000 requests/second at approximately 11:32 AM

Root Cause

Our gateway clusters, which are shared across all projects in a region, were unable to scale quickly enough to handle the sudden flood of requests. While our systems did attempt to autoscale in response to the first spike, the scaling speed was insufficient for the volume of incoming traffic. Although Cloudflare's DDoS protection was active, the traffic patterns allowed a significant number of requests to pass through before protection thresholds were reached.

Resolution Steps

Our on-call engineering team responded immediately upon detection at 9:10 AM PT. Given that the traffic consisted of legitimate GraphQL requests from multiple projects and IP addresses, we initially focused on adding manual capacity rather than implementing blocks. The additional capacity helped minimize the impact of subsequent spikes, though some service degradation still occurred.

Preventive Measures

We are implementing the following improvements to prevent similar incidents:

  1. Enhanced Autoscaling Policies
* Updated scaling triggers to better respond to connection counts
* Switched from average CPU to maximum CPU utilization metrics
  1. Rate Limiting Improvements
* Established stricter platform-wide rate limits
* Created a process for legitimate high-volume projects to request rate limit exclusions
  1. Enterprise Infrastructure Updates
* Setting up dedicated gateway clusters on top of dedicated hasura clusters for enterprise customers

For Our Customers

If your project requires exemption from the new rate limits due to legitimate high-volume traffic, please contact our support team for traffic analysis and potential exclusion setup.

For our enterprise customers: We will be reaching out separately regarding the transition to dedicated gateway clusters.

Posted Dec 04, 2024 - 02:35 UTC

Resolved
A fix has been implemented and the incident has been resolved.
Posted Nov 26, 2024 - 19:40 UTC
Investigating
A DDoS attack targeting AWS us-east-1 region is currently impacting projects hosted in this region. Hasura team is currently investigating and working on a fix.
Posted Nov 26, 2024 - 17:00 UTC
This incident affected: Hasura Cloud AWS Regions (AWS: N. Virginia (us-east-1)).