A single point of failure triggered the Amazon outage affecting millions - Ars Technica

7294 shaares

Filters

Links per page

20 50 100

A single point of failure triggered the Amazon outage affecting millions - Ars Technica

It’s always DNS
Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers’ control. The result can be unexpected behavior and potentially harmful failures.

In this case, the race condition resided in the DNS Enactor, a DynamoDB component that constantly updates domain lookup tables in individual AWS endpoints to optimize load balancing as conditions change. As the enactor operated, it “experienced unusually high delays needing to retry its update on several of the DNS endpoints.” While the enactor was playing catch-up, a second DynamoDB component, the DNS Planner, continued to generate new plans. Then, a separate DNS Enactor began to implement them.

The timing of these two enactors triggered the race condition, which ended up taking out the entire DynamoDB.

October 26, 2025 at 5:18:31 PM UTC * · permalink

https://arstechnica.com/gadgets/2025/10/a-single-point-of-failure-triggered-the-amazon-outage-affecting-millions/

Filters

Links per page

20 50 100