Duration: November 26, 2024, 9:00 AM - 11:40 AM Pacific Time
Impact: Service degradation and increased error rates for projects in AWS us-east-1 region
On November 26, our AWS us-east-1 region experienced three significant traffic spikes, processing a total of 43.68M requests. This surge caused elevated latency and error rates across projects in the region, with some projects experiencing up to 90% dropped traffic during peak periods.
* 14.69M requests
* Peak of 40,000 requests/second at approximately 9:10 AM
* 12.6M requests
* Peak of 50,000 requests/second at approximately 9:46 AM
* 12.41M requests
* Peak of 30,000 requests/second at approximately 11:32 AM
Our gateway clusters, which are shared across all projects in a region, were unable to scale quickly enough to handle the sudden flood of requests. While our systems did attempt to autoscale in response to the first spike, the scaling speed was insufficient for the volume of incoming traffic. Although Cloudflare's DDoS protection was active, the traffic patterns allowed a significant number of requests to pass through before protection thresholds were reached.
Our on-call engineering team responded immediately upon detection at 9:10 AM PT. Given that the traffic consisted of legitimate GraphQL requests from multiple projects and IP addresses, we initially focused on adding manual capacity rather than implementing blocks. The additional capacity helped minimize the impact of subsequent spikes, though some service degradation still occurred.
We are implementing the following improvements to prevent similar incidents:
* Updated scaling triggers to better respond to connection counts
* Switched from average CPU to maximum CPU utilization metrics
* Established stricter platform-wide rate limits
* Created a process for legitimate high-volume projects to request rate limit exclusions
* Setting up dedicated gateway clusters on top of dedicated hasura clusters for enterprise customers
If your project requires exemption from the new rate limits due to legitimate high-volume traffic, please contact our support team for traffic analysis and potential exclusion setup.
For our enterprise customers: We will be reaching out separately regarding the transition to dedicated gateway clusters.